We work for a "Big Five" Hollywood movie studio, and we're deciding which movies to make this year.
But the guys upstairs don't want to go through the hassle of reading a bunch of scripts and sitting through a lot of meetings.
They just want to know which genre has the best chance of succeeding.
That's where we come in.
Our goal, should we choose to accept it, or be forced into it by powerful people who write our checks, is to analyze domestic movies by genre and determine which ones seem the most promising.
So let's get to it!
We'll be using the movie data from the Data_Cleaning notebook. If you haven't had a chance to check it out, it's a collection of movies from The Numbers and Box Office Mojo, two great sources of movie information. As a reminder, all monetary data has been converted to 2018 dollars to normalize them.
We only care about the domestic box office for this study. Yes, increasingly movies are being released worldwide, but we want to first judge a movie's chances in the domestic market. Who knows, maybe a worldwide analysis will take place at another time (or notebook... in this repository... nudge nudge).
Here are the steps we will use to accomplish our task:
Decide which genres to consider in our analysis.
Decide on a profitability measure.
Whittle our dataset down to movies released by the Big Five. This is to control for factors like lack of budget or marketing affecting a movie's success. Sure, independent movies can succeed, but our bosses want to see how the genres performed with all the help of a studio behind them.
Analyze the historical performance of the genres by decade. There might be some trends over time that would be useful to know.
Analyze the historical performance of the genres by release week. Maybe certain genres perform better at certain times of year.
Hopefully, give our bosses actionable insight!
According to The Numbers, the top six genres in terms of box office gross are:
Our bosses like making money.
Sold.
The toughest question out there. Hollywood studios are notorious for phony accounting, cough cough Harry Potter and the Order of the Phoenix.
Not our bosses though. They're great. Real... stand up... people.
Anyway, we must decide on a measure for how "successful" a movie is.
For simplicity, we will judge a movie as a success if it breaks even. The reason is that there are additional revenue streams for movies beyond the theaters. There's TV airings, rentals, DVD sales, merchandise, theme park rides, spinoff TV shows, sequels, etc.
Well, there's the production budget. We can either get that information or we can't.
Then there's the second biggest expense: marketing. But it's difficult to know the marketing costs of movies. An article at How Stuff Works cites that marketing spend is typically around 50% of the production budget of a movie.
According to this article on The Week, movie studios only end up with about 50% of the total domestic box office. Movie theaters get the other half.
The breakeven point is where total earnings equal total expenses: (Domestic Box Office / 2) = (1.5 * Production Budget)
Or, to simplify: Domestic Box Office = 3 * Production Budget
At the point where the domestic box office has earned three times the original production budget (or two times its production budget when adjusted for marketing costs), we shall say the movie has broken even.
This explains why we have created a column called domestic_breakeven that performs this calculation to classify our movies.
We derive our profits equation from the breakeven equation, as profits are what remain after subtracting expenses from earnings.
Profit = (Domestic Box Office / 2) - (1.5 * Production Budget)
If the result is 0, the movie broke even.
If the result is positive, the movie made money.
If the result is negative, the movie lost money.
This explains why we have created a column called profit that performs this calculation to determine the amount of profit/loss each movie made.
In this section of the notebook, we:
We import a few libaries and set some global Jupyter notebook settings.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# For creating colormaps
import matplotlib.cm as cm
plt.style.use('fivethirtyeight')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))
pd.options.display.max_rows = 400
pd.options.display.max_columns = 50
We import the data and create a few columns we will use in our analysis.
Columns we are adding:
release_decade - We calculate the decade a movie was released using its release year.
domestic_breakeven - This is a boolean column (i.e. either True or False) based on whether the movie broke even according to our Profitability Equation.
profit - This is a numerical column that calculates the amount of profit a movie earned. We take the amount of domestic box office dollars it earned, divide it by 2, and then then subtract away 1.5 times its production budget. Any money left over is profit.
action, adventure, comedy, drama, horror, thriller_suspense - These columns are boolean columns (i.e. either True or False) that convey whether the movie in question is of the corresponding genre. For example, if a movie has the genre Comedy Drama, it will have a True value in both the comedy and drama column. This is very useful for separating our dataset by genre for graphing purposes.
data = pd.read_csv('cleaned_movie_data.csv', parse_dates=['release_date'], usecols=['title', 'distributor_mojo', 'domestic_adj', 'budget_adj', 'genres_mojo', 'release_year', 'release_week', 'release_date'])
# Only look at movies that made money domestically
data = data[data['domestic_adj'].notna() & data['domestic_adj'] > 0]
# Only look at movies with budget information
data = data[data['budget_adj'].notna()]
# For decade analysis
data['release_decade'] = data['release_year'].apply(lambda x: x // 10 * 10)
# For breakeven analysis
data['domestic_breakeven'] = data['domestic_adj'] >= 3 * data['budget_adj']
# For profit analysis
data['profit'] = (data['domestic_adj'] / 2) - (1.5 * data['budget_adj'])
# Create columns for genres
# A movie can have multiple genres. If so, we will count them for all the genres its classified with.
data['action'] = data['genres_mojo'].str.contains('Action', na=False)
data['adventure'] = data['genres_mojo'].str.contains('Adventure', na=False)
data['comedy'] = data['genres_mojo'].str.contains('Comedy', na=False)
data['drama'] = data['genres_mojo'].str.contains('Drama', na=False)
data['horror'] = data['genres_mojo'].str.contains('Horror', na=False)
data['thriller_suspense'] = data['genres_mojo'].str.contains('Thriller|Suspense', na=False, regex=True)
# Remove rows that don't contain one of our genres
data = data[data['action'] | data['adventure'] | data['comedy'] | data['drama'] | data['horror'] | data['thriller_suspense']]
# Create dataframes for the genres
#action = data[data['genres_mojo'].str.contains('Action', na=False)]
#adventure = data[data['genres_mojo'].str.contains('Adventure', na=False)]
#comedy = data[data['genres_mojo'].str.contains('Comedy', na=False)]
#drama = data[data['genres_mojo'].str.contains('Drama', na=False)]
#horror = data[data['genres_mojo'].str.contains('Horror', na=False)]
#thriller_suspense = data[data['genres_mojo'].str.contains('Thriller|Suspense', na=False, regex=True)]
data.info()
The Big Five studios we will use in our analysis are:
Studios have come and gone a lot historically. They get bought out by competitors, or go out of business. A lot of messy stuff.
To simplify, we will categorize a movie by its current studio owner. So for example, Disney recently purchased 20th Century Fox. So we will categorize a 20th Century Fox movie as Disney.
data[data['budget_adj'].notna() & data['distributor_mojo'].notna()]['distributor_mojo'].value_counts()
# Create a regex string to combine movies into their respective distributor
# https://en.wikipedia.org/wiki/Major_film_studio#Past
nbcuniversal = 'Universal|Focus Features|Focus World|Gramercy|Working Title|Big Idea|DreamWorks$|Illumination|Carnival|Mac Guff|United International'
print(data[data['distributor_mojo'].str.contains(nbcuniversal, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(nbcuniversal, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
viacom = 'Paramount|BET|Comedy Central|MTV|Nickelodeon|Bardel Entertainment|MTV Animation|Nickelodeon Animation Studio|Awesomeness|CMT|Melange|United International Pictures|VH1|Viacom 18 Motion Pictures'
print(data[data['distributor_mojo'].str.contains(viacom, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(viacom, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
warnermedia = 'Warner Bros.|CNN Films|HBO|DC Films|New Line|Cartoon Network Studios|Wang Film Productions|Adult Swim Films|Castle Rock Entertainment|Cinemax|Flagship|Fullscreen|Hello Sunshine|Spyglass'
print(data[data['distributor_mojo'].str.contains(warnermedia, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(warnermedia, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
disney = 'Walt Disney|^Fox$|Fox Atomic|A&E|Disneynature|ESPN|Fox Searchlight|Hulu|National Geographic|VICE|Fox Family|Lucasfilm|Marvel|The Muppets Studio|UTV Motion Pictures|20th Century Fox Animation|Blue Sky Studios|Lucasfilm Animation|Marvel Animation|Pixar Animation Studios|Buena Vista|Disney|Dragonfly Film Productions|Fox Star Studios|Fox Studios Australia|Kudos Film|New Regency|Patagonik Film Group|Shine Group|Tiger Aspect Productions|Zero Day Fox'
print(data[data['distributor_mojo'].str.contains(disney, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(disney, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
sony = 'Sony|Columbia|Affirm|Screen Gems|Stage 6|Ghost Corps|Funimation|Madhouse|Manga Entertainment UK|TriStar|Destination Films|Left Bank Pictures|Triumph Films'
print(data[data['distributor_mojo'].str.contains(sony, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(sony, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
data['universal'] = data['distributor_mojo'].str.contains(nbcuniversal, na=False, regex=True)
data['paramount'] = data['distributor_mojo'].str.contains(viacom, na=False, regex=True)
data['warner'] = data['distributor_mojo'].str.contains(warnermedia, na=False, regex=True)
data['disney'] = data['distributor_mojo'].str.contains(disney, na=False, regex=True)
data['sony'] = data['distributor_mojo'].str.contains(sony, na=False, regex=True)
data['distributor'] = np.nan
data.loc[data['universal'], 'distributor'] = 'Universal'
data.loc[data['paramount'], 'distributor'] = 'Paramount'
data.loc[data['warner'], 'distributor'] = 'Warner'
data.loc[data['disney'], 'distributor'] = 'Disney'
data.loc[data['sony'], 'distributor'] = 'Sony'
# We only want to keep rows that have one of the Big Five
data = data[data['distributor'].notna()]
figure, axis = plt.subplots()
figure.suptitle('Big 5 Share Of The Domestic Movie Market')
data['distributor'].value_counts().plot(kind='pie')
axis.set_ylabel('');
We have very few movies from before the 1970s. We will remove these entries to simplify our analysis.
data['release_decade'].value_counts()
data = data[data['release_decade'] >= 1970]
data.info()
Our filtered dataset now has 2,531 entries.
The movie studios all have a fair chunk of the dataset. This will hopefully prevent bias stemming from lack of equitable market share.
We have no missing values, so we can do all monetary calculations safely.
In this section of the notebook, we:
# Tailored from matplotlib documentation
# https://matplotlib.org/examples/api/barchart_demo.html
# Function to add counts/percentages to bar plots
def autolabel(axis, num_decimals=0, counts=None, fontsize=20):
"""
Attach a text label above each bar displaying its height.
If sent a list of counts, display those instead.
"""
for i, val in enumerate(axis.patches):
if counts is not None:
height = counts[i]
else:
height = round(val.get_height(), num_decimals) if num_decimals > 0 else int(round(val.get_height(), 0))
# We don't want to display zeros on our bar plots
if (height == 0) or pd.isnull(height):
continue
# Put the count below a negative value bar
if height < 0:
axis.text(val.get_x() + val.get_width()/2, val.get_height()*0.95, '{}'.format(height), ha='center', va='bottom', fontsize=fontsize)
else:
axis.text(val.get_x() + val.get_width()/2, val.get_height()*1.05, '{}'.format(height), ha='center', va='bottom', fontsize=fontsize)
# Create custom function to generate the color list when graphing
def generate_color_list(colors_needed=1, order_list=['action', 'adventure', 'comedy', 'drama', 'horror', 'thriller_suspense']):
colors_available = ['color1', 'color2', 'color3']
c_list = []
# Matplotlib needs a list of colors if the graph doesn't have multiple columns per index
if colors_needed == 1:
c_list = [genres_dict[genre][colors_available[0]] for genre in order_list]
return c_list
# Matplotlib needs a list of tuples if the graph has multiple columns per index
for i in range(colors_needed):
temp_tuple = tuple([genres_dict[genre][colors_available[i]] for genre in order_list])
c_list.append(temp_tuple)
return c_list
In this section of the notebook, we:
# Create lists of useful information for graphing
genres = ['action', 'adventure', 'comedy', 'drama', 'horror', 'thriller_suspense']
colors = ['#008FD5', '#FC4F30', '#E5AE38', '#6D904F', '#8B8B8B', '#810F7C']
colors2 = ['#87C7E5', '#F4BAB0', '#F4DBA8', '#C7E2AE', '#D6D1D1', '#CE8EDB']
colors3 = ['#C5E7F7', '#F4D7D2', '#F9ECD1', '#E3F2D5', '#EAE8E8', '#ECC8F4']
# Create a dictionary holding the colors for each genre
genres_dict = {
'action': {'color1': '#008FD5', 'color2': '#87C7E5', 'color3': '#C5E7F7', 'colormap': 'Blues'},
'adventure': {'color1': '#FC4F30', 'color2': '#F4BAB0', 'color3': '#F4D7D2', 'colormap': 'Oranges'},
'comedy': {'color1': '#E5AE38', 'color2': '#F4DBA8', 'color3': '#F9ECD1', 'colormap': 'Reds'},
'drama': {'color1': '#6D904F', 'color2': '#C7E2AE', 'color3': '#E3F2D5', 'colormap': 'Greens'},
'horror': {'color1': '#8B8B8B', 'color2': '#D6D1D1', 'color3': '#EAE8E8', 'colormap': 'Greys'},
'thriller_suspense': {'color1': '#810F7C', 'color2': '#CE8EDB', 'color3': '#ECC8F4', 'colormap': 'Purples'}
}
# Create a summary statistics dataframe separated by genre to make graphing easier
# The columns are:
# Number of movies
# Average gross
# All-time gross
# Average budget
# All-time budget
# Dollar earned for dollar spent (including marketing -- adjusted budget is 1.5 times original budget)
# Median dollars earned for dollars spent
# Mean dollars earned for dollars spent
# Median profit
# Mean profit
# All-time profit
# Breakeven percentage
# Current decade (2010s) median profit
# Current decade (2010s) mean profit
# Current decade (2010s) all profit
# Current decade (2010s) breakeven percentage
aggregation_stats_per_genre = {
'num_movies': [data[genre].sum() for genre in genres],
'avg_gross': [round(data[data[genre]]['domestic_adj'].mean() / 1000000, 1) for genre in genres],
'median_gross': [round(data[data[genre]]['domestic_adj'].median() / 1000000, 1) for genre in genres],
'all_time_gross': [round(data[data[genre]]['domestic_adj'].sum() / 1000000000, 1) for genre in genres],
'avg_budget': [round(data[data[genre]]['budget_adj'].mean() / 1000000, 1) for genre in genres],
'median_budget': [round(data[data[genre]]['budget_adj'].median() / 1000000, 1) for genre in genres],
'all_time_budget': [round(data[data[genre]]['budget_adj'].sum() / 1000000000, 1) for genre in genres],
'dollars_earned_for_dollars_spent': [round((data[data[genre]]['domestic_adj'].sum() / 2000000) / (1.5 * data[data[genre]]['budget_adj'].sum() / 1000000), 1) for genre in genres],
'median_dollars_earned_for_dollars_spent': [round((data[data[genre]]['domestic_adj'].median() / 2000000) / (1.5 * data[data[genre]]['budget_adj'].median() / 1000000), 1) for genre in genres],
'mean_dollars_earned_for_dollars_spent': [round((data[data[genre]]['domestic_adj'].mean() / 2000000) / (1.5 * data[data[genre]]['budget_adj'].mean() / 1000000), 1) for genre in genres],
'median_profit': [round((data[data[genre]]['profit'].median() / 1000000), 1) for genre in genres],
'mean_profit': [round((data[data[genre]]['profit'].mean() / 1000000), 1) for genre in genres],
'all_time_profit': [round(data[data[genre]]['profit'].sum() / 1000000000, 1) for genre in genres],
'breakeven_percentage': [round(data[data[genre]]['domestic_breakeven'].sum() / data[data[genre]]['domestic_breakeven'].count() * 100, 1) for genre in genres],
'current_decade_median_profit': [round((data[(data[genre]) & (data['release_year'] >=2010)]['profit'].median() / 1000000), 1) for genre in genres],
'current_decade_mean_profit': [round((data[(data[genre]) & (data['release_year'] >=2010)]['profit'].mean() / 1000000), 1) for genre in genres],
'current_decade_profit': [round(data[(data[genre]) & (data['release_year'] >=2010)]['profit'].sum() / 1000000000, 1) for genre in genres],
'current_decade_breakeven_percentage': [round(data[(data[genre]) & (data['release_year'] >=2010)]['domestic_breakeven'].mean() * 100, 1) for genre in genres]
}
summary = pd.DataFrame(aggregation_stats_per_genre, index=genres)
summary
In this section of the notebook, we want to get a broad overview of the entire dataset.
# Create custom function to make bar graphs with our summary dataframe
def plot_summary_dataframe(summary, sort_column, plot_columns, title, colors_needed=1, legend_needed=False, legend_text=[], y_label='Millions', num_decimals=0):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle(title, fontsize=20, y=1.02)
summary.sort_values(sort_column, ascending=False, inplace=True)
color_list = generate_color_list(colors_needed=colors_needed, order_list=summary.index)
summary.plot(y=plot_columns, kind='bar', ax=axis, color=color_list, legend=legend_needed)
axis.set_ylabel(y_label, fontsize=20)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
if legend_needed:
axis.legend(legend_text, fontsize=20)
autolabel(axis, num_decimals=num_decimals)
plt.tight_layout()
plot_summary_dataframe(summary=summary, sort_column='num_movies', plot_columns='num_movies',
title='Number of Movies Per Genre', colors_needed=1, legend_needed=False, legend_text=[], y_label='', num_decimals=0)
Let's start with some exploratory data analysis looking at the big picture.
Here, we look at overall trends of how much money the movies in our dataset have earned at the worldwide box office.
# Create custom function to plot different aggregate statistics as histograms
def plot_aggregate_histogram(data, stat, title, bins=10, color=genres_dict['action']['color2']):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24,9))
figure.suptitle(title, fontsize=20)
(data[stat] / 1000000).plot.hist(bins=bins, ax=axis, fontsize=20, color=color)
axis.set_xlabel('Millions of Dollars', fontsize=20)
axis.set_ylabel('Number of Movies', fontsize=20)
axis.axvline(data[stat].median() / 1000000, color='k', linewidth=1)
axis.axvline(data[stat].mean() / 1000000, color='r', linewidth=1)
axis.legend(['Median: {:.1f} million'.format(data[stat].median() / 1000000), 'Mean: {:.1f} million'.format(data[stat].mean() / 1000000)], fontsize=20)
plot_aggregate_histogram(data=data, stat='domestic_adj', title='Domestic Grosses',
bins=range(0, 1500, 25), color=genres_dict['action']['color2'])
Here, we look at overall trends for production budgets for the movies in our dataset.
plot_aggregate_histogram(data=data, stat='budget_adj', title='Domestic Budgets',
bins=range(0, 400, 10), color=genres_dict['action']['color2'])
Here, we look at overall trends for how much profit (as defined by our profitability equation) the movies in our dataset have earned.
Note that these profits are calculated only based on domestic box office figures. Many of these movies would have been released abroad as well, so this isn't painting the full financial picture. For that, see the Worldwide.ipynb notebook.
plot_aggregate_histogram(data=data, stat='profit', title='Domestic Profits', bins=range(-400, 675, 25), color=genres_dict['action']['color2'])
Number of movies
Skewed grosses
Skewed budgets
Skewed profits
Use median
In this section of the notebook, we get an overall sense of how the genres compare to each other historically.
This graph shows the results of adding up the domestic box office grosses for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='all_time_gross', plot_columns='all_time_gross',
title='Total Domestic Gross Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Billions', num_decimals=0)
This graph shows the results of adding up the domestic budgets for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='all_time_budget', plot_columns='all_time_budget',
title='Total Budgets Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Billions', num_decimals=0)
This graph shows the results of adding up the domestic profits for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='all_time_profit', plot_columns='all_time_profit',
title='Total Profit Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Billions', num_decimals=0)
This graph shows the results of taking the median value of domestic profits for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='median_profit', plot_columns='median_profit',
title='Median Profit Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Millions', num_decimals=0)
This graph shows the results of taking the mean value of domestic profits for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='mean_profit', plot_columns='mean_profit',
title='Mean Profit Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Millions', num_decimals=0)
Highest gross
Profitability
Median profit
Mean profit
Thoughts
In this section of the notebook, we take a closer look at the domestic box office grosses by genre.
plot_summary_dataframe(summary=summary, sort_column='avg_gross', plot_columns=['avg_gross', 'median_gross'],
title='Mean and Median Domestic Gross', colors_needed=2, legend_needed=True,
legend_text=['Mean', 'Median'], y_label='Millions', num_decimals=0)
This graph shows a histogram of the domestic box office grosses of all movies in our dataset, separated by genre.
The width of the bars is 50 million dollars. This means that each bar represents the number of movies that have grossed an amount of money somewhere in that 50 million dollar range.
Note that we are only showing those movies with domestic grosses up to 900 million dollars here. This is to make the graphs easier to read by not having too much empty space spanning the larger budget values that have very few entries.
This only excludes three movies in our dataset:
# Custom function to plot histograms of a stat by genre
def plot_histograms_by_genre(data, stat, title, genres, bins=10, colors_needed=1):
figure, axes = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=True, figsize=(24,15))
figure.suptitle(title, fontsize=20)
sorted_genres = sorted([{'genre': genre, 'amount': (data[data[genre]][stat].median() / 1000000)} for genre in genres], key=lambda k: k['amount'], reverse=True)
genres_list = [item['genre'] for item in sorted_genres]
color_list = generate_color_list(colors_needed=1, order_list=genres_list)
for genre, axis, color in zip(genres_list, axes.flat, color_list):
(data[data[genre]][stat] / 1000000).plot.hist(bins=bins, ax=axis, color=color)
axis.set_title(genre, fontsize=20)
axis.axvline(data[data[genre]][stat].median() / 1000000, color='k', linewidth=1)
axis.axvline(data[data[genre]][stat].mean() / 1000000, color='r', linewidth=1)
axis.legend(['Median: {:.1f} million'.format(data[data[genre]][stat].median() / 1000000), 'Mean: {:.1f} million'.format(data[data[genre]][stat].mean() / 1000000)], fontsize=15)
axis.set_xlabel('Millions', fontsize=20)
axis.set_ylabel('Number of Movies')
plot_histograms_by_genre(data=data, stat='domestic_adj', title='Domestic Gross Distributions',
genres=genres, bins=range(0, 900, 50), colors_needed=1)
Skew
Mean and median
Action/Adventure
In this section, we examine the Action/Adventure subgenre to see how it affects the Gross, Budget, and Profit of the Action and Adventure genres.
print('Median gross of Action/Adventure movies: ${:.1f} million'.format(data[data['genres_mojo'] == 'Action / Adventure']['domestic_adj'].median() / 1000000))
print('Median gross of Action (without Adventure component) movies: ${:.1f} million'.format(data[(data['genres_mojo'].str.contains('Action', na=False)) & (~data['genres_mojo'].isin(['Action / Adventure']))]['domestic_adj'].median() / 1000000))
print('Median gross of Adventure (without Action component) movies: ${:.1f} million'.format(data[(data['genres_mojo'].str.contains('Adventure', na=False)) & (~data['genres_mojo'].isin(['Action / Adventure']))]['domestic_adj'].median() / 1000000))
# Custom function to show the Action/Adventure subgenres effect on Action and Adventure stats
def action_adventure_stats(genre, stat, title):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle(title, fontsize=20)
median_with_action_adventure = data[data[genre]][stat].median() / 1000000
median_without_action_adventure = data[(data[genre]) & (data['genres_mojo'] != 'Action / Adventure')][stat].median() / 1000000
grp = data[data[genre]].groupby('genres_mojo')[stat].median().sort_values(ascending=False) / 1000000
grp.plot(kind='bar', ax=axis, color=genres_dict[genre]['color1'])
axis.axhline(median_with_action_adventure, color='k', linewidth=1)
axis.axhline(median_without_action_adventure, color='r', linewidth=1)
axis.tick_params(labelsize=20)
axis.set_xlabel('')
axis.set_ylabel('Millions', fontsize=20);
axis.legend(['Overall Median With Action/Adventure: {:.1f}'.format(median_with_action_adventure),
'Overall Median Without Action/Adventure: {:.1f}'.format(median_without_action_adventure)], loc='best', fontsize=20)
autolabel(axis)
action_adventure_stats(genre='action', stat='domestic_adj', title='Median Gross By Action Subgenres')
action_adventure_stats(genre='action', stat='budget_adj', title='Median Budget By Action Subgenres')
action_adventure_stats(genre='action', stat='profit', title='Median Profit By Action Subgenres')
action_adventure_stats(genre='adventure', stat='domestic_adj', title='Median Gross By Adventure Subgenres')
action_adventure_stats(genre='adventure', stat='budget_adj', title='Median Budget By Adventure Subgenres')
action_adventure_stats(genre='adventure', stat='profit', title='Median Profit By Adventure Subgenres')
Action/Adventure is the culprit!
Gross
Budget
Profit
Keep it in the back of our minds
In this section of the notebook, we take a closer look at production budgets of each genre since the 1970s.
plot_summary_dataframe(summary=summary, sort_column='avg_budget', plot_columns=['avg_budget', 'median_budget'],
title='Mean and Median Production Budget', colors_needed=2, legend_needed=True,
legend_text=['Mean', 'Median'], num_decimals=0)
plot_histograms_by_genre(data=data, stat='budget_adj', title='Production Budget Distributions',
genres=genres, bins=10, colors_needed=1)
plot_aggregate_histogram(data=data, stat='budget_adj', title='Production Budgets Of All Movies',
bins=10, color=genres_dict['action']['color2'])
Median production budget
Low budgets
In this section of the notebook, we take a closer look at domestic profits for each genre since the 1970s.
Since we define a movie's genre by all the genres it contains, many of our movies have multiple genres that we care about.
It would make life easier if, in one column, we could store a statistic and the genre of the movie.
Obviously, this involves duplication in situations where a movie has multiple genres (for example, an Action/Adventure movie counts as both an Action and Adventure movie).
These custom functions create new columns that contain the corresonding statistic (domestic gross, budget, profit, breakeven) and whether the movie is of a certain genre.
This makes graphing certain things much easier.
# We want individual columns that hold a specific domestic stat for each genre.
# Since a movie can have multiple genres, right now we must isolate each genre with a groupby while looping over each genre.
# If we create individual columns that contain information about a genre and a domestic stat, it's easier to graph later.
def domestic_stat_by_genre(row, genre, stat):
# Returns either 0 or the stat value due to boolean multiplication.
test = row[genre] * row[stat]
# If the row is not in the genre (i.e. False * $100 = 0)
if test == 0:
return np.nan
else:
return test
# We want individual columns that store breakeven information for each genre.
# Since we will be adding the entries in these columns (and using pd.DataFrame.mean()), we need to convert them to 1's and 0's.
# Thus, we need to create a separate function from the 'domestic_stat_by_genre' function.
def test_for_breakeven_by_genre(row, genre, breakeven_column):
if row[genre]:
if row[breakeven_column]:
return 1
else:
return 0
else:
return np.nan
# List of new columns to hold domestic stats by genre.
budget_columns = ['domestic_budget_{}'.format(genre) for genre in genres]
gross_columns = ['domestic_gross_{}'.format(genre) for genre in genres]
profit_columns = ['domestic_profit_{}'.format(genre) for genre in genres]
breakeven_columns = ['domestic_breakeven_{}'.format(genre) for genre in genres]
for genre, col in zip(genres, budget_columns):
data[col] = data.apply(lambda x: domestic_stat_by_genre(x, genre, 'budget_adj'), axis=1)
for genre, col in zip(genres, gross_columns):
data[col] = data.apply(lambda x: domestic_stat_by_genre(x, genre, 'domestic_adj'), axis=1)
for genre, col in zip(genres, profit_columns):
data[col] = data.apply(lambda x: domestic_stat_by_genre(x, genre, 'profit'), axis=1)
for genre, col in zip(genres, breakeven_columns):
data[col] = data.apply(lambda x: test_for_breakeven_by_genre(x, genre, 'domestic_breakeven'), axis=1)
def plot_boxplot(data, genres, title, columns, starting_year=1970, y_label=''):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24,15))
figure.suptitle(title, fontsize=20, y=1.05)
data[data['release_year'] >= starting_year][columns].plot(kind='box', ax=axis)
axis.set_ylabel(y_label, fontsize=20)
axis.set_xticklabels(genres)
axis.tick_params(labelsize=20)
axis.axhline(0, color='k', linewidth=1)
plt.tight_layout()
plot_boxplot(data=data, genres=genres, title='Profit By Genre', columns=profit_columns, starting_year=1970, y_label='Hundreds Of Millions')
Long right tails and negative medians!
Distribution shapes
# Custom function to plot profit data of our subgenres. The genres are sorted from highest to lowest profit stat.
def profit_by_subgenres(data, genres, aggregation_function='median', apply_function=lambda x: x / 1000000):
sorted_genres = sorted([{'genre': genre, 'amount': (data[data[genre]]['profit'].agg(aggregation_function))} for genre in genres], key=lambda k: k['amount'], reverse=True)
genres_list = [item['genre'] for item in sorted_genres]
color_list = generate_color_list(colors_needed=1, order_list=genres_list)
figure, axes = plt.subplots(nrows=6, ncols=1, figsize=(24, 54))
for genre, color, axis in zip(genres_list, color_list, axes.flat):
overall_stat = data[data[genre]]['profit'].agg(aggregation_function) / 1000000
(data[data[genre]].groupby('genres_mojo')['profit'].agg(aggregation_function).apply(apply_function).sort_values(ascending=False)).plot(kind='bar', ax=axis, color=color)
axis.axhline(overall_stat, color='k', linewidth=1)
axis.tick_params(labelsize=20)
axis.set_xlabel('')
axis.set_ylabel('Millions', fontsize=20)
axis.set_title('{} Profit By {} Subgenres'.format(aggregation_function.title(), genre.title()), fontsize=20, y=1.02)
axis.legend(['Overall {}: {:.1f}'.format(aggregation_function.title(), overall_stat)], loc=3, fontsize=20)
autolabel(axis)
plt.tight_layout()
profit_by_subgenres(data=data, genres=genres, aggregation_function='median', apply_function=lambda x: x / 1000000)
The domestic market isn't enough
Best subgenres?
profit_by_subgenres(data=data, genres=genres, aggregation_function='mean', apply_function=lambda x: x / 1000000)
Results are similar to median profits by subgenre
Here's another way to measure how successful a genre is -- you look at the ratio of earnings to expenses.
In other words, for each movie we capture (Worldwide Box Office / 2) / (1.5 * Production Budget).
Then for each genre, we can either add up all the results (i.e. see how the genre fares for every datapoint we have), take the median, or take the mean.
plot_summary_dataframe(summary=summary, sort_column='dollars_earned_for_dollars_spent',
plot_columns=['dollars_earned_for_dollars_spent', 'mean_dollars_earned_for_dollars_spent', 'median_dollars_earned_for_dollars_spent'],
title='Dollars Earned Per Dollar Spent', colors_needed=3, legend_needed=True,
legend_text=['All-Time', 'Mean', 'Median'], y_label='Dollars', num_decimals=1)
We need the international market
We can calculate the percentage chance a movie has to break even as another way to judge relative risk.
plot_summary_dataframe(summary=summary, sort_column='breakeven_percentage', plot_columns='breakeven_percentage',
title='Breakeven Percentage', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Percent', num_decimals=1)
Less than 30%
This would be a difficult decision to make.
From the 1970s to today, Comedy and Drama are the two most frequently produced genres with 995 and 680 movies, respectively.
Comedy, Action, and Adventure have made the most overall money domestically. Adventure and Action have far higher median domestic grosses than Comedy (\$106 million and \\$87 million versus \$48 million). Comedy must be making up for this with its higher numbers of movies released. However, Adventure and Action are the two most expensive genres to make (\\$117 million and \$87 million), whereas Comedy is fourth at \\$36 million.
The highest aggregate return for every dollar spent comes from Horror, then Comedy.
The highest median return for every dollar spent also comes from Horror, then Comedy.
The highest median profit per genre is Horror, then Drama, then Comedy. (Note though that all three of these numbers are negative, since no genre has a positive median profit all-time.)
The genres with the lowest median budgets are Drama (\$30 million), then Horror (\\$31 million), then Comedy (\$36 million).
Genres with the best chance to breakeven are Horror, then Comedy, then Drama.
Since all genres are median losers domestically, if our distribution strategy only includes the domestic market, we shouldn't hope to come out ahead in the long run. However, if pushed to select the best genres, I would suggest Horror, then either Comedy or Drama.
Horror is one of the cheapest genres to produce and yet it has the highest median profit per genre. It also has the highest chance to break even. Our bosses could make around three Horror movies for the price of a single Action movie, or four Horror movies for the price of one Adventure movie. Horror's median gross of \$53 million is well below Action's \\$87 million and Adventure's \$106 million, but it would be a safer play.
It's tough to decide if the next best option is Comedy or Drama.
Comedy has the third cheapest median budget and the third highest median profit. Historically, it is a solid genre, as it has earned in aggregate the most amount of box office dollars. It also earns the second-most amount of money on a per-dollar spent basis.
Drama has the cheapest median budget, the second-highest median profit, and the third-highest chance to break even. So while these movies tend to break even less often than Comedies, they tend to be less expensive to produce and make more money, on average.
If our bosses really care about releasing those mega blockbusters and risk be damned, then we all know they're talking about Action and Adventure, the kings of the right tails. They have by far the highest median grosses, but are also the biggest median losers (-\\$79.7 million and -\\$96.5 million) respectively.
They also have the lowest chance to break even, at 9.5% for Action and 8.5% for Adventure.
These two genres require the additional revenue of the international box office the most for survival.
For the domestic market specifically, I'd stay away from Action and Adventure, and focus on Horror and either Comedy or Drama.
There are a lot of reasons why studios prefer to make certain genres over others. After all, it's tough to make theme park rides based on Drama and Comedy movies. This analysis assumes we only care about how much money a movie makes at the box office.
So far, we have only analyzed these genres in aggregate.
Our bosses want more pinpoint accuracy. Which genres are the hottest right now? Which genres perform the best during which parts of the year?
So we've got more digging to do, and we'll next look for trends by Release Decade and Release Week.
We will now dive into the performance of movies by decade of release.
Up until now, we haven't been looking at our data from a time perspective. We have only been looking at movies by genre.
It's time to look at our data by genre and by decade.
To make graphing easier, we create a custom function to help us do this.
def plot_by_time_and_stat(data, genres, title, groupby_column, stat_columns, aggregate_function, apply_needed=False, apply_function=None, y_label='', y_ticks_needed=False, y_ticks='', legend_needed=True, legend_text=genres, color=colors, axhline_needed=False, axhline_value='', autolabel_needed=False, autolabel_fontsize=20):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24,9))
figure.suptitle(title, fontsize=20)
if apply_needed:
data.groupby(groupby_column)[stat_columns].agg(aggregate_function).apply(apply_function).plot(kind='bar', ax=axis, color=color)
else:
data.groupby(groupby_column)[stat_columns].agg(aggregate_function).plot(kind='bar', ax=axis, color=color)
axis.set_xlabel('')
axis.set_ylabel(y_label, fontsize=20)
axis.tick_params(labelsize=20)
if legend_needed:
axis.legend(legend_text, loc='best', fontsize=15)
if y_ticks_needed:
axis.set_yticks(y_ticks)
if axhline_needed:
axis.axhline(axhline_value, color='k', linewidth=1)
if autolabel_needed:
autolabel(axis, fontsize=autolabel_fontsize)
plot_by_time_and_stat(data=data, genres=genres, title='Genres Released By Decade',
groupby_column='release_decade', stat_columns=genres, aggregate_function='sum',
apply_needed=False, apply_function=None, y_label='Number of Movies', y_ticks_needed=False, y_ticks='')
1970s to 2000s
2000s to 2010s
Fewer movies made now
We will look at a couple graphs to get a sense of our movies without separating them by genre.
plot_by_time_and_stat(data=data, genres=genres, title='Total Domestic Box Office By Decade',
groupby_column='release_decade', stat_columns='domestic_adj',
aggregate_function='sum', apply_needed=True, apply_function=lambda x: x / 1000000000,
y_label='Billions', y_ticks_needed=False, y_ticks='', legend_needed=False,
legend_text='', color=genres_dict['action']['color2'])
plot_by_time_and_stat(data=data, genres=genres, title='Total Domestic Box Office By Year',
groupby_column='release_year', stat_columns='domestic_adj',
aggregate_function='sum', apply_needed=True, apply_function=lambda x: x / 1000000000,
y_label='Billions', y_ticks_needed=False, y_ticks='', legend_needed=False,
legend_text='', color=genres_dict['action']['color2'])
plot_by_time_and_stat(data=data, genres=genres, title='Total Domestic Gross By Genre And Decade', groupby_column='release_decade', stat_columns=gross_columns, aggregate_function='sum', apply_needed=True, apply_function=lambda x: x / 1000000000, y_label='Billions', y_ticks_needed=False, y_ticks='')
# Create custom function to determine the background color for labeling the genre with the highest stat per decade
def find_genre_for_background_color(groupby_instance, decade):
column_name_list = groupby_instance.loc[decade].sort_values(ascending=False).index[0].split('_')
# Check if the split string has length 4, if so it is thriller_suspense and requires extra filtering
# The reason is our genres are 'action', 'adventure', 'comedy', 'drama', 'horror', and 'thriller_suspense'
# Our worldwide stat column names have the following form: worldwide_(stat name)_genre
# Thus five of our six genres will have length 3 when split on '_', but 'thriller_suspense' will have length 4
if len(column_name_list) == 4:
return '_'.join(column_name_list[-2:])
# If the genre is not 'thriller_suspense', we just need the last word in the list
return column_name_list[-1]
# Create a stacked bar plot of a stat by genre for each year, highlighting the genre with the highest value in each decade
def plot_stat_by_year_and_highlight_decade_winner(data, genres, title, stat_columns, aggregation_function, apply_function=None, y_label=''):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle(title, fontsize=20, y=1.05)
# Determine background colors for each decade
grp = data.groupby('release_decade')[stat_columns].agg(aggregation_function)
bg_1970 = genres_dict[find_genre_for_background_color(grp, 1970)]['color1']
bg_1980 = genres_dict[find_genre_for_background_color(grp, 1980)]['color1']
bg_1990 = genres_dict[find_genre_for_background_color(grp, 1990)]['color1']
bg_2000 = genres_dict[find_genre_for_background_color(grp, 2000)]['color1']
bg_2010 = genres_dict[find_genre_for_background_color(grp, 2010)]['color1']
# Set up plot
grp = data.groupby('release_year')[stat_columns].agg(aggregation_function).apply(apply_function)
grp.plot(kind='bar', stacked=True, ax=axis)
axis.set_ylabel(y_label, fontsize=20)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
axis.legend(genres, fontsize=20)
axis.axvspan(0, 10, color=bg_1970, alpha=0.2)
axis.axvspan(10, 20, color=bg_1980, alpha=0.2)
axis.axvspan(20, 30, color=bg_1990, alpha=0.2)
axis.axvspan(30, 40, color=bg_2000, alpha=0.2)
axis.axvspan(40, 50, color=bg_2010, alpha=0.2)
axis.axvline(10, color='k', alpha=0.2)
axis.axvline(20, color='k', alpha=0.2)
axis.axvline(30, color='k', alpha=0.2)
axis.axvline(40, color='k', alpha=0.2)
plt.tight_layout()
plot_stat_by_year_and_highlight_decade_winner(data=data, genres=genres,
title='Total Domestic Gross By Genre and Year\n(Background Color Is Highest Earning Genre Per Decade)',
stat_columns=gross_columns, aggregation_function='sum',
apply_function=lambda x: x / 1000000000, y_label='Billions')
Domestic box office decrease in 2010s
Highest-grossing genres
Last three decades
From the 2000s to 2010s
def plot_mean_and_median_by_time_and_stat(data, genres, groupby_column, stat_columns, stat_name_for_title, apply_needed=False, apply_function=None, y_label='', y_ticks_needed=False, y_ticks='', axhline_needed=False, axhline_value=''):
figure, (axis1, axis2) = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
if apply_needed:
data.groupby(groupby_column)[stat_columns].agg('mean').apply(apply_function).plot(kind='bar', ax=axis1)
data.groupby(groupby_column)[stat_columns].agg('median').apply(apply_function).plot(kind='bar', ax=axis2)
else:
data.groupby(groupby_column)[stat_columns].agg('mean').plot(kind='bar', ax=axis1)
data.groupby(groupby_column)[stat_columns].agg('median').plot(kind='bar', ax=axis2)
axis1.set_ylabel(y_label, fontsize=20)
if y_ticks_needed:
axis1.set_yticks(y_ticks)
axis2.set_yticks(y_ticks)
axis1.set_xlabel('')
axis1.tick_params(labelsize=20)
axis1.legend(genres, fontsize=20)
axis1.set_title('Mean {} By Genre And Decade'.format(stat_name_for_title), fontsize=20, y=1.02)
axis2.set_ylabel(y_label, fontsize=20)
axis2.set_xlabel('')
axis2.tick_params(labelsize=20)
axis2.legend(genres, fontsize=20)
axis2.set_title('Median {} By Genre And Decade'.format(stat_name_for_title), fontsize=20, y=1.02)
if axhline_needed:
axis1.axhline(axhline_value, color='k', linewidth=1)
axis2.axhline(axhline_value, color='k', linewidth=1)
plt.tight_layout()
plot_mean_and_median_by_time_and_stat(data=data, genres=genres, groupby_column='release_decade',
stat_columns=gross_columns, stat_name_for_title='Domestic Gross',
apply_needed=True, apply_function=lambda x: x / 1000000,
y_label='Millions', y_ticks_needed=True, y_ticks=range(0, 1100, 100))
plot_stat_by_year_and_highlight_decade_winner(data=data, genres=genres,
title='Median Domestic Gross By Genre and Year\n(Background Color Is Highest Grossing Median Genre Per Decade)',
stat_columns=gross_columns, aggregation_function='median',
apply_function=lambda x: x / 1000000, y_label='Millions')
Contracting period, then expanding period
Median gross change from 2000s to 2010s ranked from highest to lowest
Highest median gross by decade
def one_stat_over_time_in_separate_graphs(data, genres, title, figsize, colors, groupby_column, stat_column, aggregation_function, starting_year=1970, apply_needed=False, apply_function=None, xtick_values='', y_label='', axhline_needed=False, axhline_value=''):
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=figsize)
figure.suptitle(title, fontsize=20, y=1.02)
for genre, axis, color in zip(genres, axes.flat, colors):
# Create a series with decades as indexes and median budget as values
if apply_needed:
(data[(data[genre]) & (data['release_year'] >= starting_year)].groupby(groupby_column)[stat_column].agg(aggregation_function).apply(apply_function)).sort_index(ascending=True).plot(kind='bar', xticks=xtick_values, ax=axis, linewidth=3, color=color)
else:
(data[(data[genre]) & (data['release_year'] >= starting_year)].groupby(groupby_column)[stat_column].agg(aggregation_function)).sort_index(ascending=True).plot(kind='bar', xticks=xtick_values, ax=axis, linewidth=3, color=color)
axis.set_ylabel(y_label, fontsize=20)
axis.tick_params(labelsize=20)
axis.set_xlabel('')
axis.legend([genre], loc=2, fontsize=15)
autolabel(axis)
if axhline_needed:
axis.axhline(axhline_value, color='k', linewidth=1)
plt.tight_layout()
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Median Domestic Gross By Genre and Decade', figsize=(24,24),
colors=colors, groupby_column='release_decade', stat_column='domestic_adj',
aggregation_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
xtick_values=range(1970, 2020, 10), y_label='Millions', axhline_needed=False, axhline_value='')
From the 1990s to 2000s
2000s to 2010s
Horror
plot_mean_and_median_by_time_and_stat(data=data, genres=genres, groupby_column='release_decade',
stat_columns=budget_columns, stat_name_for_title='Domestic Budget',
apply_needed=True, apply_function=lambda x: x / 1000000, y_label='Millions', y_ticks_needed=False, y_ticks='')
plot_stat_by_year_and_highlight_decade_winner(data=data, genres=genres,
title='Median Domestic Budget By Genre and Year\n(Background Color Is Highest Median Genre Per Decade)',
stat_columns=budget_columns, aggregation_function='median',
apply_function=lambda x: x / 1000000, y_label='Millions')
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Median Budget By Genre and Release Decade', figsize=(24,24),
colors=colors, groupby_column='release_decade', stat_column='budget_adj',
aggregation_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
xtick_values=range(1970, 2020, 10), y_label='Millions', axhline_needed=False, axhline_value='')
Mean and median pretty much the same
Action and Adventure
The Other Four
plot_mean_and_median_by_time_and_stat(data=data, genres=genres, groupby_column='release_decade',
stat_columns=profit_columns, stat_name_for_title='Domestic Profits',
apply_needed=True, apply_function=lambda x: x / 1000000,
y_label='Millions', y_ticks_needed=False, y_ticks='', axhline_needed=True, axhline_value=0)
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Median Domestic Profit By Genre and Release Decade', figsize=(24,24),
colors=colors, groupby_column='release_decade', stat_column='profit',
aggregation_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
xtick_values=range(1970, 2020, 10), y_label='Millions', axhline_needed=True, axhline_value=0)
The average movie is not a domestic winner
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Breakeven Percentage By Decade', figsize=(24,16),
colors=colors, groupby_column='release_decade', stat_column='domestic_breakeven',
aggregation_function='mean', apply_needed=True, apply_function=lambda x: x * 100,
xtick_values=range(1970, 2020, 10), y_label='Percentage', axhline_needed=False, axhline_value='')
plot_summary_dataframe(summary=summary,
sort_column='current_decade_breakeven_percentage',
plot_columns='current_decade_breakeven_percentage',
title='Current Decade Breakeven Percentage By Genre',
colors_needed=1,
legend_needed=False,
legend_text=[],
y_label='Percentage',
num_decimals=1)
1970s to 2000s
2000s to 2010s
Safest current genres
Movies are risky today
The movie business is so variable that looking at trends within subgenres probably doesn't yield much actionable insight.
But we shall look at mean and median profitability of subgenres by decade just in case.
# Function to plot mean and median profitability by subgenre by decade
def subgenre_profitability_by_decade(genre, colors):
subgenres = data[data[genre.lower()]].groupby('genres_mojo').count().index
num_subgenres = len(subgenres)
figure, axes = plt.subplots(nrows=num_subgenres, ncols=1, figsize=(24, 50), sharex=True)
figure.suptitle('Mean and Median Profit By {} Subgenre And Decade'.format(genre.title()), fontsize=20, y=1.02)
for subgenre, axis in zip(subgenres, axes.flat):
grp = (data[data['genres_mojo'] == subgenre].groupby('release_decade').agg(['mean', 'median']) / 1000000)['profit']
# If the series is missing a decade, add it as an index and set the value to zero
for decade in range(1970, 2020, 10):
if decade not in grp.index:
grp.loc[decade] = 0
# Sort the series by its index to have the decades in chronological order
grp.sort_index(ascending=True, inplace=True)
grp.plot(kind='bar', xticks=range(1970, 2020, 10), color=colors, linewidth=3, ax=axis)
axis.set_ylabel('Millions', fontsize=20)
axis.set_title(subgenre, fontsize=20)
axis.legend(['Mean', 'Median'], loc='lower left', fontsize=15)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
axis.axhline(0, color='k', linewidth=1)
autolabel(axis)
plt.tight_layout()
subgenre_profitability_by_decade('action', [genres_dict['action']['color1'], genres_dict['action']['color2']])
subgenre_profitability_by_decade('adventure', [genres_dict['adventure']['color1'], genres_dict['adventure']['color2']])
subgenre_profitability_by_decade('comedy', [genres_dict['comedy']['color1'], genres_dict['comedy']['color2']])
Fantasy Comedies were median profitable in the 1990s. (This was due to their being three Fantasy Comedies, only one of which lost money. However, the loss in that was so large as to render the subgenre a net loser in terms of mean profitability.)
No other subgenre has been median profitable in the 1990s, 2000s, or 2010s.
subgenre_profitability_by_decade('drama', [genres_dict['drama']['color1'], genres_dict['drama']['color2']])
subgenre_profitability_by_decade('horror', [genres_dict['horror']['color1'], genres_dict['horror']['color2']])
subgenre_profitability_by_decade('thriller_suspense', [genres_dict['thriller_suspense']['color1'], genres_dict['thriller_suspense']['color2']])
There aren't any stellar subgenres in the domestic market.
It seems that for the most part, both the main genres and their subgenres are on average unprofitable at the domestic box office.
The movie industry has changed quite a bit since the 1970s. This has been a nice overview of the industry over the past 50 years, but now we will focus on just the current decade.
We might be able to find some useful insights about the state of the industry as it is now.
We will now dive into the performance of movies in this current decade (2010 - 2018).
budget_bins that categorizes each movie by budget size. The options are '0 - 1m', '1 - 5m', '5 - 10m', '10 - 25m', '25 - 50m', '50 - 100m', '100 - 200m', '200 - 300m', and '300 - 400m'. These represent where each movie's production budget falls (in millions of dollars). Then we create some custom functions to display domestic profits, breakeven percentage, and the number of movies released for all genres when subdivided by budget size.one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Percentage of Movies That Breakeven This Decade', figsize=(24,16),
colors=colors, groupby_column='release_year', stat_column='domestic_breakeven',
aggregation_function='mean', starting_year=2010, apply_needed=True, apply_function=lambda x: x * 100,
xtick_values=range(2010, 2019, 1), y_label='Percentage', axhline_needed=False, axhline_value='')
plot_summary_dataframe(summary=summary, sort_column='current_decade_breakeven_percentage',
plot_columns='current_decade_breakeven_percentage', title='Current Decade Breakeven Percentage By Genre',
colors_needed=1, legend_needed=False, legend_text=[], y_label='Percentage', num_decimals=1)
Year by year takeaways
plot_summary_dataframe(summary=summary, sort_column='current_decade_profit',
plot_columns=['current_decade_profit', 'current_decade_mean_profit', 'current_decade_median_profit'],
title='Current Decade Aggregate, Mean, and Median Profit By Genre', colors_needed=3,
legend_needed=True, legend_text=['Aggregate Profit (In Billions)', 'Mean Profit (In Millions)', 'Median Profit (In Millions)'],
y_label='Millions', num_decimals=1)
# https://matplotlib.org/3.1.0/tutorials/colors/colormap-manipulation.html
# https://stackoverflow.com/questions/1735025/how-to-normalize-a-numpy-array-to-within-a-certain-range
# https://matplotlib.org/users/gridspec.html#gridspec-and-subplotspec
# Import colormap functionality from matplotlib
import matplotlib.cm as cm
# To scale our counts array from [0,1] create custom colormap
from sklearn.preprocessing import minmax_scale
figure = plt.figure(figsize=(24,12))
figure.suptitle('Median Profit By Subgenre This Decade', fontsize=20)
gs = matplotlib.gridspec.GridSpec(50, 50)
ax1 = plt.subplot(gs[:, :-1])
ax2 = plt.subplot(gs[:, -1:])
grp = data[data['release_year'] >= 2010].groupby('genres_mojo')['profit'].agg(['median', 'count']).sort_values(by='median', ascending=False)
# Use 'viridis' colormap
viridis = cm.get_cmap('viridis')
# Normalize our counts series
scaled_counts = minmax_scale(grp['count'].astype(float), feature_range=(0,1))
# List of colors using rescaled count values
new_cmap = [viridis(item) for item in scaled_counts]
(grp['median'] / 1000000).plot(kind='bar', ax=ax1, color=new_cmap)
norm = matplotlib.colors.Normalize(vmin=grp['count'].min(), vmax=grp['count'].max())
cb1 = matplotlib.colorbar.ColorbarBase(ax2, cmap=viridis, norm=norm, orientation='vertical')
ax2.set_ylabel('Number of Movies', fontsize=20)
ax1.set_xlabel('')
ax1.set_ylabel('Millions', fontsize=20)
ax1.tick_params(labelsize=20)
Aggregate Profit
Mean Profit
Median Profit
Subgenres
It might help to further subdivide our genres by their budgets to look for patterns there.
bins = [0, 1000000, 5000000, 10000000, 25000000, 50000000, 100000000, 200000000, 300000000, 400000000]
group_names = ['0 - 1m', '1 - 5m', '5 - 10m', '10 - 25m', '25 - 50m', '50 - 100m', '100 - 200m', '200 - 300m', '300 - 400m']
subgenre_colors = ['#8d6a9f', '#006494', '#fcfc62', '#2d4739', '#bb342f', '#6eeb83', '#e56399', '#ffe8d4', '#57886c', '#ff7700', '#16f4d0', '#bfae48', '#90c290', '#330f0a']
data['budget_bins'] = pd.cut(data['budget_adj'], bins, labels=group_names)
# https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
# Custom function to add blue if a majority of the films break even
def background_color_blue_if_greater_than_fifty_percent(val):
if val > 0.5:
return 'background-color: {}'.format('#87C7E5')
return ''
# https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
# Custom function to highlight the max value in a series
def highlight_max(data, color='yellow'):
attr = 'background-color: {}'.format(color)
if data.ndim == 1: # Series from .apply(axis=0) or axis=1
is_max = data == data.max()
return [attr if v else '' for v in is_max]
else: # from .apply(axis=None)
is_max = data == data.max().max()
return pd.DataFrame(np.where(is_max, attr, ''), index=data.index, columns=data.columns)
# Create custom function to display profit and count information by genre and budget size for the current decade (2010s)
def current_decade_budget_sizes(data, genre):
styler_object = (data[
(data['release_year'] >= 2010) &
(data['genres_mojo'].str.contains(genre))
][['budget_bins', 'profit', 'domestic_breakeven']]
.apply(lambda x: x / 1000000 if x.name == 'profit' else x)
.sort_values(by=['budget_bins', 'profit'], ascending=False)
.groupby('budget_bins')
.agg(['mean', 'median', 'count', 'sum'])
.drop([('profit', 'count'), ('profit', 'mean'), ('profit', 'sum'), ('domestic_breakeven', 'median')], axis=1)
.dropna()
.style
.applymap(background_color_blue_if_greater_than_fifty_percent, subset=[('domestic_breakeven', 'mean')])
.apply(highlight_max, subset=[('domestic_breakeven', 'count')])
.background_gradient('winter', subset=[('profit', 'median')]))
return styler_object
action_budget_info = current_decade_budget_sizes(data=data, genre='Action')
action_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
adventure_budget_info = current_decade_budget_sizes(data=data, genre='Adventure')
adventure_budget_info
Most produced budget
Most profitable budget (descending order)
Least profitable budget
Conclusions
comedy_budget_info = current_decade_budget_sizes(data=data, genre='Comedy')
comedy_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
drama_budget_info = current_decade_budget_sizes(data=data, genre='Drama')
drama_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
horror_budget_info = current_decade_budget_sizes(data=data, genre='Horror')
horror_budget_info
Most produced budget
Most profitable budgets
Least profitable budget
Conclusions
thriller_suspense_budget_info = current_decade_budget_sizes(data=data, genre='Thriller|Suspense')
thriller_suspense_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
plot_boxplot(data=data, genres=genres, title='Profit By Genre, Current Decade', columns=profit_columns, starting_year=2010, y_label='Hundreds of Millions')
The boxplot helps shed some light on each genre's strengths and weaknesses.
The horizontal black represents the point where a movie breakes even domestically.
The yellow line in each box is the median profit/loss value for that genre at the domestic box office this decade.
The two T-shapes (one above and one below each box, also called "whiskers") represent the highest and lowest "normal" values for the genre this decade. These represent the highest and lowest values that movies in each genre can have that we wouldn't consider unusual.
The circles on the outside of these whiskers represent outliers. These are either mega-hits or mega-bombs at the domestic box office.
The average movie in any genre is not domestically profitable
Action and Adventure
Comedy, Drama, Horror, and Thriller/Suspense
Recommendation
Final step
We will now dive into the performance of movies by their release week in the calendar year (e.g. 1 - 53 (some years stretch into a 53rd week)).
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle('Number of Movies Released By Week\n(Seasons Are Color-Coded)', fontsize=20, y=1.05)
grp = data.groupby('release_week')[genres].sum()
grp.plot(kind='bar', stacked=True, ax=axis)
axis.set_ylabel('Count', fontsize=20)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
axis.legend(genres, fontsize=20)
# Subtract one from axvspan ranges to account for it being a bar chart and not a line chart (e.g. Spring is weeks 9-22)
axis.axvspan(8, 21, alpha=0.1, facecolor='pink')
axis.axvspan(21, 34, alpha=0.1, facecolor='yellow')
axis.axvspan(34, 47, alpha=0.1, facecolor='orange')
axis.axvspan(47, 52, alpha=0.1, facecolor='green')
axis.axvspan(0, 8, alpha=0.1, facecolor='green')
plt.tight_layout()
Other than a few weeks, there have been a healthy amount of movies being released on every possible week.
Let's subdivide by genre to get a better look.
def num_movies_released_by_release_week_by_genre(data, title, starting_year=1970, genres=genres, colors=colors):
figure, axes = plt.subplots(nrows=6, ncols=1, figsize=(24, 16), sharex=True, sharey=True)
figure.suptitle(title, fontsize=20, y=1.02)
for genre, axis, color in zip(genres, axes.flat, colors):
grp = data[(data['release_year'] >= starting_year) & (data[genre])].groupby('release_week')['title'].count()
# If the series is missing a decade, add it as an index
# Then set the value to 0
for week in range(1, 54):
if week not in grp.index:
grp.loc[week] = 0
grp.sort_index(inplace=True, ascending=True)
grp.plot(kind='bar', xticks=range(1, 54), ax=axis, linewidth=3, color=color)
axis.set_ylabel('Count', fontsize=12)
axis.set_xlabel('')
axis.legend([genre], loc=2, fontsize=15)
# Subtract one from axvspan ranges to account for it being a bar chart and not a line chart
axis.axvspan(8, 21, alpha=0.1, facecolor='pink')
axis.axvspan(21, 34, alpha=0.1, facecolor='yellow')
axis.axvspan(34, 47, alpha=0.1, facecolor='orange')
axis.axvspan(47, 52, alpha=0.1, facecolor='green')
axis.axvspan(0, 8, alpha=0.1, facecolor='green')
plt.tight_layout()
num_movies_released_by_release_week_by_genre(data=data, title='Number of Movies Released By Release Week',
starting_year=1970, genres=genres, colors=colors)
Comedy has been released in good numbers in practically every week.
Drama is released the most in Fall and Winter.
Action and Adventure are released the most in Summer.
Horror and Thriller/Suspense don't really have clear patterns.
num_movies_released_by_release_week_by_genre(data=data, title='Number of Movies Released By Release Week, This Decade',
starting_year=2010, genres=genres, colors=colors)
Comedy is released consistently in more weeks than any other genre.
Drama is still weighted towards Fall and Winter weeks.
Action is concentrated on Summer releases.
Adventure, Horror, and Thriller/Suspense have less clear patterns.
# Custom function to graph a fill_between line graph of a stat's performance by release week in two ways: all-time and the current decade
def fill_between_by_release_week(data, title, stat, genres=genres, colors=colors, y_label='Millions'):
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
figure.suptitle(title, fontsize=20, y=1.05)
for genre, axis, color in zip(genres, axes.flat, colors):
grp1 = data[(data['release_year'] < 2010) & (data[genre])].groupby('release_week')[stat].median() / 1000000
grp2 = data[(data['release_year'] >= 2010) & (data[genre])].groupby('release_week')[stat].median() / 1000000
for year in range(1, 54, 1):
if year not in grp1.index:
grp1.loc[year] = 0
if year not in grp2.index:
grp2.loc[year] = 0
# Sort the series by its index to have the decades in chronological order
grp1.sort_index(ascending=True, inplace=True)
grp2.sort_index(ascending=True, inplace=True)
axis.plot(range(1,54), grp1, color=colors[0], label='1970-2009')
axis.plot(range(1,54), grp2, color=colors[1], label='This Decade')
axis.fill_between(range(1, 54), y1=grp1, y2=grp2, where=grp2 <= grp1, facecolor=colors[0], interpolate=True, edgecolor='k')
axis.fill_between(range(1, 54), y1=grp1, y2=grp2, where=grp2 > grp1, facecolor=colors[1], interpolate=True, edgecolor='k')
axis.set_title(genre, fontsize=20)
axis.set_ylabel(y_label, fontsize=12)
axis.set_xlabel('')
axis.legend(loc=2, fontsize=15)
axis.axvspan(9, 22, alpha=0.2, color='pink')
axis.axvspan(22, 35, alpha=0.2, color='yellow')
axis.axvspan(35, 48, alpha=0.2, color='orange')
axis.axvspan(48, 53, alpha=0.2, color='green')
axis.axvspan(1, 9, alpha=0.2, color='green')
plt.tight_layout()
fill_between_by_release_week(data=data, title='Median Gross By Release Week\n(Seasons Are Color-Coded)',
stat='domestic_adj', genres=genres, colors=colors, y_label='Millions')
Action
Adventure
Comedy and Drama
Horror
Thriller/Suspense
fill_between_by_release_week(data=data, title='Median Budget By Release Week\n(Seasons Are Color-Coded)',
stat='budget_adj', genres=genres, colors=colors, y_label='Millions')
fill_between_by_release_week(data=data, title='Median Profit By Release Week\n(Seasons Are Color-Coded)',
stat='profit', genres=genres, colors=colors, y_label='Millions')
release_weeks_with_no_movies_all_time = [0] * 6
counter = [0, 1, 2, 3, 4, 5]
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
figure.suptitle('Percentage of Movies That Breakeven By Release Week', fontsize=20, y=1.02)
for genre, axis, color, count in zip(genres, axes.flat, colors, counter):
# Create two series with decades as indexes and count and sum as values
#grp_count = data[data[genre]].groupby('release_week')['worldwide_breakeven'].count().copy()
#grp_sum = data[data[genre]].groupby('release_week')['worldwide_breakeven'].sum().copy()
grp = data[data[genre]].groupby('release_week')['domestic_breakeven'].mean() * 100
# If the series is missing a decade, add it as an index
# Then set the count to 1 and the sum to 0
# This avoids division by zero problems when calculating the percentages
for week in range(1, 54):
if week not in grp.index:
grp.loc[week] = 0
axis.axvline(week - 1, color='white', linewidth=2)
release_weeks_with_no_movies_all_time[count] += 1
# Sort the series by their index to have the decades in chronological order
grp.sort_index(ascending=True, inplace=True)
grp.plot(kind='bar', xticks=range(1, 54), ax=axis, linewidth=3, color=color)
axis.set_ylabel('Percentage', fontsize=12)
axis.legend([genre], loc=2, fontsize=15)
# Show 50% breakeven line
axis.axhline(50, color='k', linewidth=1)
axis.axvspan(9, 22, alpha=0.2, color='pink')
axis.axvspan(22, 35, alpha=0.2, color='yellow')
axis.axvspan(35, 48, alpha=0.2, color='orange')
axis.axvspan(48, 53, alpha=0.2, color='green')
axis.axvspan(1, 9, alpha=0.2, color='green')
plt.tight_layout()
There doesn't appear to be any clear patterns with respect to release week and a movie's chance to break even.
release_weeks_with_no_movies_this_decade = [0] * 6
counter = [0, 1, 2, 3, 4, 5]
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
figure.suptitle('Percentage of Movies That Breakeven By Release Week, This Decade', fontsize=20, y=1.02)
for genre, axis, color, count in zip(genres, axes.flat, colors, counter):
# Create two series with decades as indexes and count and sum as values
grp = data[(data['release_year'] >= 2010) & (data[genre])].groupby('release_week')['domestic_breakeven'].mean() * 100
# If the series is missing a decade, add it as an index
# Then set the count to 1 and the sum to 0
# This avoids division by zero problems when calculating the percentages
for week in range(1, 54):
if week not in grp.index:
grp.loc[week] = 0
axis.axvline(week - 1, color='white', linewidth=3)
release_weeks_with_no_movies_this_decade[count] += 1
# Sort the series by their index to have the decades in chronological order
grp.sort_index(ascending=True, inplace=True)
grp.plot(kind='bar', xticks=range(1, 54), ax=axis, linewidth=3, color=color)
axis.set_ylabel('Percentage', fontsize=12)
axis.legend([genre], loc=2, fontsize=15)
# Show 50% breakeven line
axis.axhline(50, color='k', linewidth=1)
axis.axvspan(9, 22, alpha=0.2, color='pink')
axis.axvspan(22, 35, alpha=0.2, color='yellow')
axis.axvspan(35, 48, alpha=0.2, color='orange')
axis.axvspan(48, 53, alpha=0.2, color='green')
axis.axvspan(1, 9, alpha=0.2, color='green')
plt.tight_layout()
release_week = pd.DataFrame({'all_time': release_weeks_with_no_movies_all_time, 'this_decade': release_weeks_with_no_movies_this_decade}, index=genres)
release_week.sort_values(by='all_time', ascending=False, inplace=True)
color_list = generate_color_list(colors_needed=2, order_list=release_week.index)
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle('Number of Release Weeks Where No Movies Have Been Released, By Genre', fontsize=20, y=1.05)
release_week.plot(kind='bar', ax=axis, color=color_list)
axis.set_ylabel('Number of Release Weeks', fontsize=20)
axis.tick_params(labelsize=20)
axis.legend(fontsize=20)
autolabel(axis, fontsize=14)
plt.tight_layout()
sorted_genres = sorted([{'genre': genre, 'amount': (summary['breakeven_percentage'][genre])} for genre in genres], key=lambda k: k['amount'], reverse=True)
genres_list = [item['genre'] for item in sorted_genres]
color_list = generate_color_list(colors_needed=2, order_list=genres_list)
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle('Breakeven Percentage All-Time Versus This Decade', fontsize=20, y=1.05)
summary.reindex(genres_list).plot(y=['breakeven_percentage', 'current_decade_breakeven_percentage'], kind='bar', color=color_list, ax=axis)
axis.set_ylabel('Percentage', fontsize=20)
axis.tick_params(labelsize=20)
axis.legend(['All-Time', 'Current Decade'], fontsize=20)
autolabel(axis, fontsize=14)
plt.tight_layout()
plot_by_time_and_stat(data=data, genres=genres, title='Breakeven Percentage By Decade',
groupby_column='release_decade', stat_columns=breakeven_columns,
aggregate_function='mean', apply_needed=True, apply_function=lambda x: x * 100,
y_label='Percentage', y_ticks_needed=False, y_ticks='', legend_needed=True,
legend_text=genres, color=colors, axhline_needed=True, axhline_value=50, autolabel_needed=True, autolabel_fontsize=14)
Current Decade
Safest
Highest potential return per movie
Most calendar-friendly
Then Horror, Horror, Horror.
There's a pretty good reason Blumhouse is doing so well. It makes high quality movies that are inexpensive to produce. It's basically impossible to do that with Action or Adventure movies, but it can be done with Horror. Other studios could mimic Blumhouse's business model with the least expensive genres.
Median Budgets This Decade
How many movies could we make for the same price as a typical Action or Adventure movie (not including marketing costs)?
Number of movies per one Action movie
Number of movies per one Adventure movie
Then what's the problem? Why can't we make these low to mid budget movies at a fraction of the cost and make money on them?
The Streaming Problem
Median Grosses This Decade
Currently, Drama, Comedy, and Thriller/Suspense might be too expensive theatrically but cheap enough for streaming. Even though they have much smaller budgets, the amount of marketing dollars to wide release a movie is substantial if you aren't great at viral marketing campaigns. Blumhose is particularly good at getting the most for their marketing dollar.
Studios may be shifting a lot of low budget fare to streaming platforms, where they get predetermined fees for their content and save big on marketing dollars.
The writing seems to be on the wall. Action and Adventure are the only other genres that are doing well this decade. They tend to travel well, which means the explosion in the foreign box office market bodes well for them.
They are the most expensive genres to produce and market, but they are the big winners in terms of box office dollars.
Our dataset only includes revenue that movies generate from ticket sales, but that is only a slice of the movie revenue pie.
Stephen Follows has a great article detailing the revenue stream of movies nowadays. The following image comes from his article.

To summarize, the release windows are:
Many of these later release windows gain higher license fees if a movie is successful at the box office, making the theatrical window very important. On the other hand, theatrical isn't the only moneymaker, and movies can make up for lackluster box office with future revenue streams.
Here are some next steps to spruce up our analyis:
We have only scratched the surface in our analysis here, but our results provide very actionable insight. Genres that travel well (Action and Adventure) are earning the most these days, and Horror, due to its low cost and consistently good box office results, is a great genre to invest in.
plot_by_time_and_stat(data=data, genres=genres, title='Median Profitability By Decade',
groupby_column='release_decade', stat_columns=profit_columns,
aggregate_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
y_label='Millions', y_ticks_needed=False, y_ticks='', legend_needed=True,
legend_text=genres, color=colors, axhline_needed=True, axhline_value=50, autolabel_needed=True, autolabel_fontsize=14)
plot_by_time_and_stat(data=data, genres=genres, title='Numbers Released By Decade',
groupby_column='release_decade', stat_columns=genres, aggregate_function='sum',
apply_needed=False, apply_function=None, y_label='Number of Movies', y_ticks_needed=False, y_ticks='', autolabel_needed=True, autolabel_fontsize=14)